implement file splitting functionality and enhance documentation by Zhangg7723 · Pull Request #51 · oceanbase/powerrag

Zhangg7723 · 2026-01-28T08:23:59Z

Summary

Added split_file and split_file_upload methods to support file chunking via local paths, URLs, and uploads.
Updated README.md to include detailed examples for text and file splitting methods.
Enhanced error handling for unsupported parsers and invalid file inputs.
Introduced tests for file splitting functionalities to ensure reliability.

Solution Description

…nitial SDK configuration

…ackage configuration

… 3.10

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…f 'markdown' for parsed results

…tation

dosubot · 2026-01-28T08:24:14Z

Documentation Updates

2 document(s) were updated by changes in this PR:

Markdown Processing and Chunking

View Changes

@@ -1,12 +1,21 @@
-PowerRAG provides a robust system for processing markdown documents, supporting multiple chunking strategies to optimize retrieval-augmented generation (RAG) workflows. The system is designed to preserve document structure, handle complex markdown elements, and manage chunk sizes for downstream tasks such as embedding or LLM input.
+PowerRAG provides a robust system for processing documents, supporting multiple chunking strategies to optimize retrieval-augmented generation (RAG) workflows. The system is designed to preserve document structure, handle complex markdown elements, and manage chunk sizes for downstream tasks such as embedding or LLM input.
 
 ### Architecture and Chunking Strategies
 
-The core service for markdown chunking is `PowerRAGSplitService`, which exposes a unified interface for splitting text using different strategies, selected via the `parser_id` parameter. Supported strategies include:
+The core service is `PowerRAGSplitService`, which exposes two primary capabilities:
 
-- **Title-based chunking**: Splits content at markdown headers of a specified level, preserving section boundaries and returning both chunk content and associated titles.
-- **Regex-based chunking**: Splits text using a configurable regex pattern, then merges or further splits chunks based on token thresholds.
-- **Smart chunking**: Uses an AST-based approach to parse markdown structure, intelligently chunking by headings, containers (lists, tables), and token counts.
+1. **Text Splitting** (`split_text`): For markdown/text content using three specialized parsers
+2. **File Splitting** (`split_file`, `split_file_upload`): For files using all available ParserType methods
+
+#### Text Splitting
+
+The `split_text` method supports three specialized parsers for markdown and text content, selected via the `parser_id` parameter:
+
+- **Title-based chunking** (`title`): Splits content at markdown headers of a specified level, preserving section boundaries and returning both chunk content and associated titles.
+- **Regex-based chunking** (`regex`): Splits text using a configurable regex pattern, then merges or further splits chunks based on token thresholds.
+- **Smart chunking** (`smart`): Uses an AST-based approach to parse markdown structure, intelligently chunking by headings, containers (lists, tables), and token counts.
+
+**Note**: Only these three parsers are supported for `split_text`. For other parsers (such as `naive`, `book`, `qa`, `paper`, etc.), use the file splitting methods.
 
 Example usage:
 ```python
@@ -31,6 +40,71 @@
 
 Smart chunking parses the markdown document into an abstract syntax tree (AST) using `MarkdownIt`. It recursively processes AST nodes, treating headings as chunk boundaries and preserving containers such as lists, tables, and code blocks. Chunks are merged or split based on token counts and document structure. Large chunks are split first by headings, then by newlines, ensuring each chunk is close to the target token size and titles are preserved as prefixes.
 
+#### File Splitting
+
+The `split_file` and `split_file_upload` methods support all available ParserType methods, providing comprehensive file chunking capabilities. These methods work with local files, file URLs, and file uploads, supporting various document types including PDFs, Office documents, images, and HTML.
+
+**Supported ParserType Methods**:
+- **Basic parsers**: `naive`, `title`, `regex`, `smart`
+- **Specialized parsers**: `qa`, `book`, `laws`, `paper`, `manual`, `presentation`
+- **Format-specific parsers**: `table`, `resume`, `picture`, `one`, `email`
+
+The file splitting methods internally initialize a file chunker factory (`_init_file_chunker_factory`) that maps each ParserType to its corresponding chunking module from the `rag/app` and `powerrag/app` packages.
+
+**Usage Examples**:
+
+Using a local file path:
+```python
+service = PowerRAGSplitService()
+result = service.split_file(
+    filename="/path/to/document.pdf",
+    parser_id="book",
+    config={"chunk_token_num": 512, "delimiter": "\n。.；;！!？？"}
+)
+```
+
+Using a file URL:
+```python
+result = service.split_file(
+    filename="https://example.com/doc.pdf",
+    binary=None,  # Binary will be downloaded
+    parser_id="naive",
+    config={
+        "chunk_token_num": 256,
+        "max_file_size": 128 * 1024 * 1024,  # 128MB
+        "download_timeout": 300,  # 5 minutes
+        "head_request_timeout": 30  # 30 seconds
+    }
+)
+```
+
+Using file upload (via API):
+```python
+# Read file binary
+with open("document.pdf", "rb") as f:
+    binary = f.read()
+
+result = service.split_file(
+    filename="document.pdf",
+    binary=binary,
+    parser_id="book",
+    config={"chunk_token_num": 512}
+)
+```
+
+**Configuration Parameters for File Splitting**:
+- `chunk_token_num`: Target chunk size in tokens (default: 512)
+- `delimiter`: Delimiters for splitting large chunks (default: `"\n。.；;！!？？"`)
+- `lang`: Language for processing (default: `"Chinese"`)
+- `from_page`: Starting page number for PDF processing (default: 0)
+- `to_page`: Ending page number for PDF processing (default: 100000)
+- `max_file_size`: Maximum file size for URL downloads in bytes (file URL only)
+- `download_timeout`: Download timeout in seconds for file URLs (file URL only)
+- `head_request_timeout`: HEAD request timeout in seconds for file URLs (file URL only)
+
+The file splitting methods return chunks as a list of strings, along with metadata including the parser ID, total chunk count, and filename.
+[Source](https://github.com/oceanbase/powerrag/blob/a97000b728952b4bb42d01a2fc672b07bd0da6ec/powerrag/server/services/split_service.py#L41-L1386)
+
 ### Handling Markdown Elements
 
 Markdown elements, especially images, are carefully preserved during chunking. In smart chunking, image nodes are reconstructed using their `alt` and `src` attributes to produce the correct markdown syntax (`![alt](src)`). This ensures that image source links are not lost during chunking, addressing previous bugs where image sources were dropped in smart chunks [PR #11](https://github.com/oceanbase/powerrag/pull/11).
@@ -47,7 +121,9 @@
 
 ### Customizing and Extending Chunking Behavior
 
-Chunking behavior can be customized by selecting the appropriate `parser_id` (`title`, `regex`, `smart`) and configuring parameters such as:
+#### Text Splitting Configuration
+
+For text splitting with `split_text`, chunking behavior can be customized by selecting the appropriate `parser_id` (`title`, `regex`, `smart`) and configuring parameters such as:
 
 - `title_level`: Markdown header level for splitting (title-based chunking).
 - `chunk_token_num`: Target chunk size in tokens.
@@ -63,14 +139,37 @@
     config={"chunk_token_num": 256, "min_chunk_tokens": 64}
 )
 ```
+
+#### File Splitting Configuration
+
+For file splitting with `split_file` or `split_file_upload`, all ParserType methods are available. Configuration parameters vary by parser but commonly include:
+
+- `chunk_token_num`: Target chunk size in tokens (default: 512)
+- `delimiter`: Delimiters for splitting large chunks
+- `lang`: Language for processing
+- `from_page`, `to_page`: Page range for PDF processing
+- `max_file_size`, `download_timeout`, `head_request_timeout`: URL download settings
+
+Example using the `book` parser:
+```python
+result = service.split_file(
+    filename="/path/to/book.pdf",
+    parser_id="book",
+    config={"chunk_token_num": 512, "lang": "Chinese"}
+)
+```
 [Source](https://github.com/oceanbase/powerrag/blob/a97000b728952b4bb42d01a2fc672b07bd0da6ec/powerrag/server/services/split_service.py#L41-L1386)
+
+### Extending Chunking Logic
+
+To extend text chunking logic, implement a new chunker function and register it in the `CHUNKER_FACTORY` mapping within `PowerRAGSplitService`. Ensure your chunker accepts a configuration dictionary and returns chunks in the expected format. You may also customize AST node handling to support additional markdown elements or protected regions.
+
+To add support for new file parsers, implement a chunking module following the pattern of existing modules in `rag/app` or `powerrag/app`, and register it in the `_file_chunker_factory` mapping during initialization.
+
+### Binary File Parsing
 
 The system also supports parsing binary files (PDF, Office documents, images, HTML) into markdown, returning the markdown content, images (as base64), and metadata. The parsing configuration can specify layout recognition engines, formula and table recognition, and page ranges for PDFs [PR #40](https://github.com/oceanbase/powerrag/pull/40).
 
-### Extending Chunking Logic
-
-To extend chunking logic, implement a new chunker function and register it in the `CHUNKER_FACTORY` mapping within `PowerRAGSplitService`. Ensure your chunker accepts a configuration dictionary and returns chunks in the expected format. You may also customize AST node handling to support additional markdown elements or protected regions.
-
 ---
 
-For further details, refer to the [split_service.py implementation](https://github.com/oceanbase/powerrag/blob/a97000b728952b4bb42d01a2fc672b07bd0da6ec/powerrag/server/services/split_service.py#L41-L1386) and relevant [pull requests](https://github.com/oceanbase/powerrag/pull/11).
+For further details, refer to the [split_service.py implementation](https://github.com/oceanbase/powerrag/blob/a97000b728952b4bb42d01a2fc672b07bd0da6ec/powerrag/server/services/split_service.py#L41-L1386) and relevant pull requests: [PR #11](https://github.com/oceanbase/powerrag/pull/11), [PR #40](https://github.com/oceanbase/powerrag/pull/40), [PR #51](https://github.com/oceanbase/powerrag/pull/51).

PowerRAG SDK

View Changes

@@ -9,6 +9,7 @@
 - Parse documents to Markdown format, including direct binary parsing for PDF, Office documents, images, and HTML ([source](https://github.com/oceanbase/powerrag/pull/40)).
 - Asynchronous and synchronous document parsing, with status polling and cancellation.
 - Manage document metadata, download content, and handle document chunks.
+- Split text and files into chunks using various parser methods, including support for local files, file URLs, and file uploads.
 
 ### Knowledge Base Management
 - Create, update, list, and delete chat sessions and agents.
@@ -104,6 +105,20 @@
 - `use_kg`: Use knowledge graph (bool)
 - `toc_enhance`: Table of contents enhancement (bool)
 
+### File Splitting
+- `parser_id`: Parser method ID (str)
+  - For text splitting (`split_text`): Only supports `title`, `regex`, `smart`
+  - For file splitting (`split_file`, `split_file_upload`): Supports all ParserType methods including `naive`, `title`, `regex`, `smart`, `qa`, `book`, `laws`, `paper`, `manual`, `presentation`, `table`, `resume`, `picture`, `one`, `email`
+- `chunk_token_num`: Target chunk size in tokens (int, default 512)
+- `delimiter`: Delimiter string (str, default `"\n。.；;！!？？"`)
+- `lang`: Language (str, default `"Chinese"`)
+- `from_page`, `to_page`: Page range for PDFs (int, default 0 and 100000)
+- `file_path`: Local file path (str, optional, for `split_file`)
+- `file_url`: Remote file URL (str, optional, for `split_file`)
+- `max_file_size`: Maximum file size in bytes for URL downloads (int, optional, default 128MB)
+- `download_timeout`: Download timeout in seconds (int, optional, default 300)
+- `head_request_timeout`: HEAD request timeout in seconds (int, optional, default 30)
+
 ## Usage Examples
 
 ### Create a Dataset and Upload Documents
@@ -189,12 +204,82 @@
     print(chunk.content)
 ```
 
+### Split Text into Chunks
+```python
+# Text splitting only supports: title, regex, smart
+result = client.chunk.split_text(
+    text="# Chapter 1\n\nThis is the content of chapter 1.\n\n# Chapter 2\n\nThis is chapter 2.",
+    parser_id="title",  # Only: title, regex, or smart
+    config={"chunk_token_num": 512}
+)
+
+print(f"Total chunks: {result['total_chunks']}")
+for chunk in result['chunks']:
+    print(chunk)  # chunks are strings
+```
+
+### Split Files into Chunks
+```python
+# Method 1: Split file from local path (server must have access to the path)
+result = client.chunk.split_file(
+    file_path="/path/to/document.pdf",
+    parser_id="book",  # Supports all ParserType methods
+    config={
+        "chunk_token_num": 512,
+        "delimiter": "\n。.；;！!？？",
+        "lang": "Chinese",
+        "from_page": 0,
+        "to_page": 100
+    }
+)
+
+# Method 2: Split file from URL
+result = client.chunk.split_file(
+    file_url="https://example.com/document.pdf",
+    parser_id="naive",
+    config={
+        "chunk_token_num": 256,
+        "max_file_size": 128 * 1024 * 1024,  # 128MB
+        "download_timeout": 300,
+        "head_request_timeout": 30
+    }
+)
+
+# Method 3: Upload file and split
+result = client.chunk.split_file_upload(
+    file_path="/path/to/local/document.pdf",
+    parser_id="book",
+    config={"chunk_token_num": 512}
+)
+
+print(f"Total chunks: {result['total_chunks']}")
+print(f"Filename: {result['filename']}")
+print(f"Parser used: {result['parser_id']}")
+for chunk in result['chunks']:
+    print(chunk)  # chunks are strings
+```
+
+**Supported ParserType methods for file splitting:**
+- Basic: `naive`, `title`, `regex`, `smart`
+- Professional: `qa`, `book`, `laws`, `paper`, `manual`, `presentation`
+- Special formats: `table`, `resume`, `picture`, `one`, `email`
+
+**Return value structure:**
+```python
+{
+    "parser_id": "book",
+    "chunks": ["chunk1", "chunk2", ...],  # List of strings
+    "total_chunks": 10,
+    "filename": "document.pdf"
+}
+```
+
 ## Integration Guidelines
 1. Install the SDK via pip.
 2. Import `PowerRAGClient` from `powerrag.sdk`.
 3. Initialize the client with your API key and server URL.
-4. Use resource objects (`dataset`, `document`, `chat`, `agent`) and their methods for all operations.
-5. Configure advanced options as needed for parsing, retrieval, and chat/agent creation.
+4. Use resource objects (`dataset`, `document`, `chat`, `agent`, `chunk`) and their methods for all operations.
+5. Configure advanced options as needed for parsing, retrieval, file splitting, and chat/agent creation.
 6. Handle exceptions as raised by SDK methods for error management.
 7. Refer to type annotations and docstrings for IDE assistance.
 
@@ -208,3 +293,69 @@
 
 ## License
 PowerRAG SDK is licensed under Apache-2.0 ([source](https://github.com/oceanbase/powerrag/pull/27)).
+
+## Frequently Asked Questions
+
+### What is the difference between text splitting and file splitting methods?
+
+The SDK provides three different methods for chunking content:
+
+**`split_text`**: Text-only splitting
+- Only supports three parser methods: `title`, `regex`, `smart`
+- Designed for plain text or Markdown content
+- No file handling required
+
+**`split_file`**: File splitting via path or URL
+- Supports all ParserType methods (15+ parsers)
+- Can process files from local paths (`file_path`) or remote URLs (`file_url`)
+- Server must have access to local paths when using `file_path`
+
+**`split_file_upload`**: Upload and split
+- Supports all ParserType methods (15+ parsers)
+- Uploads file from local system to server before splitting
+- Best for local files when server doesn't have direct access
+
+**When to use each method:**
+
+Use `split_text` when:
+- You have plain text or Markdown content
+- You only need `title`, `regex`, or `smart` parsers
+- You don't have a file to process
+
+Use `split_file` when:
+- You need parsers other than `title`, `regex`, or `smart` (e.g., `book`, `qa`, `naive`, `paper`)
+- The file is accessible via a URL
+- The file is on the server's filesystem (accessible via `file_path`)
+
+Use `split_file_upload` when:
+- You need parsers other than `title`, `regex`, or `smart`
+- The file is on your local machine
+- The server doesn't have direct access to the file path
+
+**Examples:**
+
+```python
+# Text splitting (only title, regex, smart)
+result = client.chunk.split_text(
+    text="# Chapter 1\n\nContent...",
+    parser_id="title"
+)
+
+# File splitting from local path
+result = client.chunk.split_file(
+    file_path="/server/path/doc.pdf",
+    parser_id="book"  # Can use any parser
+)
+
+# File splitting from URL
+result = client.chunk.split_file(
+    file_url="https://example.com/doc.pdf",
+    parser_id="naive"
+)
+
+# Upload and split
+result = client.chunk.split_file_upload(
+    file_path="/local/path/doc.pdf",
+    parser_id="qa"
+)
+```

^{How did I do? Any feedback?}

Copilot

Pull request overview

This PR implements file splitting functionality for PowerRAG, enabling users to chunk documents via local paths, URLs, and uploads using various parser types. The changes enhance the existing text-only splitting capabilities by adding support for file-based parsing methods from the rag/app module.

Changes:

Added split_file and split_file_upload methods to support file chunking through local paths, URLs, and direct uploads
Enhanced error handling with more descriptive messages for unsupported parsers in text splitting
Updated documentation with comprehensive examples distinguishing between text and file splitting methods

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 18 comments.

Show a summary per file

File	Description
powerrag/server/services/split_service.py	Adds `_init_file_chunker_factory` and `split_file` method to support file-based chunking with all parser types; improves error messages for unsupported text parsers
powerrag/server/routes/powerrag_routes.py	Implements `/split/file` and `/split/file/upload` endpoints; changes ConnectionError status code from 503 to 400
powerrag/sdk/modules/chunk_manager.py	Adds `split_file` and `split_file_upload` client methods with support for both file paths and URLs
powerrag/sdk/tests/test_chunk.py	Adds tests for file splitting upload functionality and unsupported parser error handling; changes test parser from "naive" to "regex"
powerrag/sdk/tests/test_document.py	Adds sleep delays for async operation timing in cancel_parse test
powerrag/sdk/README.md	Comprehensive documentation updates clarifying split_text vs split_file usage, with examples for all three file splitting methods
api/apps/sdk/powerrag_proxy.py	Adds proxy endpoints for split_file operations; improves file handling with BytesIO wrapper for async file reading

powerrag/server/routes/powerrag_routes.py

powerrag/sdk/modules/chunk_manager.py

powerrag/server/routes/powerrag_routes.py

api/apps/sdk/powerrag_proxy.py

powerrag/server/routes/powerrag_routes.py

powerrag/server/services/split_service.py

powerrag/sdk/tests/test_chunk.py

powerrag/sdk/README.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…dling

…rrag-github into powerrag_sdk_api

Zhangg7723 and others added 24 commits December 31, 2025 16:55

feat:add PowerRAG SDK and API Proxy

352fa8f

Merge branch 'oceanbase:main' into powerrag_sdk_api

1f60049

Merge branch 'oceanbase:main' into powerrag_sdk_api

8b735e4

feat: add GitHub Actions workflow for Python package publishing and i…

f8d2bd5

…nitial SDK configuration

chore: update GitHub Actions workflow for SDK publishing and refine p…

35ba7e2

…ackage configuration

chore: update Python version requirement in pyproject.toml to support…

6325328

… 3.10

chore: add environment configuration for PyPI in GitHub Actions workflow

79d1294

Merge branch 'oceanbase:main' into powerrag_sdk_api

3dbcd5d

docs: update SDK README.md

16a0245

Merge branch 'oceanbase:main' into powerrag_sdk_api

e160ce0

feat(document): add binary file parsing to Markdown method

f15d0ef

refactor(document_manager): centralize parse to markdown upload logic

7732022

refactor(init): remove module docstring and __all__ exports

33a0520

Merge branch 'oceanbase:main' into powerrag_sdk_api

23dde0b

chore(docker): add GOTENBERG server environment variables

399d17a

Update docker/.env.example

f28c272

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Merge branch 'oceanbase:main' into powerrag_sdk_api

d92c91d

feat(document): add input_type parameter for file type detection

e9780ed

feat(document): add support for file_url to parse documents from URL

045ad0a

Update powerrag/utils/file_utils.py

5cd36ad

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update powerrag/utils/file_utils.py

5635866

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

refactor(document): update README and code to use 'content' instead o…

2b3496a

…f 'markdown' for parsed results

Merge branch 'oceanbase:main' into powerrag_sdk_api

6a1564d

feat(sdk): implement file splitting functionality and enhance documen…

15d378b

…tation

dosubot bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Jan 28, 2026

dosubot bot added documentation Improvements or additions to documentation enhancement New feature or request labels Jan 28, 2026

whhe requested a review from Copilot February 5, 2026 13:03

Copilot started reviewing on behalf of whhe February 5, 2026 13:03 View session

Copilot AI reviewed Feb 5, 2026

View reviewed changes

dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Feb 11, 2026

Zhangg7723 force-pushed the powerrag_sdk_api branch from ca05bc5 to 15d378b Compare February 11, 2026 10:48

dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:XXL This PR changes 1000+ lines, ignoring generated files. labels Feb 11, 2026

Zhangg7723 and others added 3 commits February 24, 2026 20:15

Update powerrag/server/services/split_service.py

537e40a

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

feat(sdk): enhance file splitting functionality and improve error han…

0cb7666

…dling

Merge branch 'powerrag_sdk_api' of https://github.com/Zhangg7723/powe…

372e38d

…rrag-github into powerrag_sdk_api

whhe approved these changes Feb 25, 2026

View reviewed changes

whhe merged commit fd6e540 into oceanbase:main Feb 25, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement file splitting functionality and enhance documentation#51

implement file splitting functionality and enhance documentation#51
whhe merged 27 commits intooceanbase:mainfrom
Zhangg7723:powerrag_sdk_api

Zhangg7723 commented Jan 28, 2026

Uh oh!

dosubot bot commented Jan 28, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Zhangg7723 commented Jan 28, 2026

Summary

Solution Description

Uh oh!

dosubot bot commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Markdown Processing and Chunking

PowerRAG SDK

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dosubot bot commented Jan 28, 2026 •

edited

Loading